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ABSTRACT 

In  speech  recognition  applications,  it  is  often  desirable  to 
make  a  gross  characterization  of  the  shape  of  the  spectrum  of 
a  particular  sound.  The  autocorrelation  method  of  linear 
predi  tion  analysis  leads  to  an  all-pole  approximation  to  the 
signal  spectrum.  Hence  an  LPC  analysis  using  two  poles 
produces  one  possible  gross  characterization.  The  two  poles 
are  computed  as  the  roots  of  a  quadratic  equation  whose 
coefficients  are  the  linear  prediction  parameters,  which  are 
simple  functions  of  the  .autocorrelation  coefficients  RQ ,  R^ ,  and 
R2*  Tlie  Poles  are  eitiier  both  real  or  form  a  conjugate  pair  in 
the  z  plane.  This  fact,  together  with  the  exact  positions  of 
the  poles,  is  particularly  useful  in  describing  certain  gross 
characteristics  of  the  spectrum.  The  spectral  dynamic  range  of 
the  two-pole  spectrum  and  the  normalized  minimum  error  are 
suggested  as  more  suitable  substitutes  for  the  two-pole  bandwidths 
in  interpreting  the  information  supplied  by  the  model  for  the 
purpose  of  spectral  characterization. 
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I.  INTRODUCTION 

In  the  analysis  of  speech  signals  it  is  often  desirable 
to  make  gross  characterizations  of  speech  spectra.  This  is 
useful  in  speech  recognition  for  the  purposes  of  segmentation 
as  well  cs  the  general  classification  of  the  different  sounds. 

In  the  past,  gross  spectral  characterizations  have  been 
obtained  by  computing  parameters  that  depended  on  the  energy 
contained  in  different  regions  of  the  spectrum.  Other  methods 
have  employed  measurements  of  zero  crossing  rates  and  zero 
crossing  distances.  In  this  paper  we  describe  a  method  for  the 
gross  characterization  of  speech  spectra  using  a  simple  linear 
prediction  model. 

It  is  well  known  that  in  linear  prediction  the  signal 
spectrum  is  modelr  or  approximated  by  an  all-pole  spectrum 
[1,2].  The  number  of  poles  in  the  approximate  spectrum  is 
arbitrary  and  is  set  to  different  values  depending  on  the 
sampling  frequency  of  the  signal  and  on  the  particular 
application.  For  example,  for  a  10  kHz  sampled  signal,  a  14-pole 
model  is  now  common  for  the  purposes  of  spectral  envelope 
estimation  and  formant  extraction.  However,  if  we  assume  that 
our  gross  characterization  is  to  consist  in  the  poles  themselves, 
then  a  14-pole  model  contains  too  much  information,  and  it  takes 
a  relatively  long  time  to  compute,  since  it  involves  finding  the 
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roots  of  a  14th  degree  polynomial.  We  have  found  that  a  two-pole 
model  is  optimal  in  terms  of  three  things: 

(1)  ease  of  computation, 

(2)  adequacy  of  representation, 

(3)  ease  of  interpretation. 

*  .  These  three  points  are  discussed  in  the  following  three  sections. 

II.  TWO-POLE  MODEL 

▼  y 

The  transfer  function  of  the  two— pole  model  is  given  by 


where  and  a2  are  the  predictor  coefficients,  and  A  is  a  gain 
factor. 

The  coefficients  a^  and  &2  can  be  computed  using  either 
the  autocorrelation  or  covariance  method  of  linear  prediction  [1]. 
Although  much  of  the  discussion  in  this  paper  also  applies  to  the 
covariance  method,  we  shall  work  exclusively  with  the 
autocorrelation  method.  In  the  latter  method,  a^  and  a2  are 
solutions  to  the  two  equations 

alR0  +  a2Rl  =  R1  (2a) 
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where  R^  is  the  ith  autocorrelation  coefficient  of  the  signal. 


The  solution  of  (2)  gives: 

rl  d-r2) 

a,  =  - 

^  i-r  2 


a2  = 


r  _r  2 
r 2  rl 


where 


ri  =  ~R, 


,  i=0 ,1 ,2  f 


are  the  normalized  autocorrelation  coefficients  with  the 
property  chac  | r^ | — 1 .  The  gain  factor  A  can  be  shown  to  he  equal 


where 


A  =  Vv 


V  =  l-a1r1-a2r2 


is  the  normalized  minimum  error  [1], 


The  poles  of  H(z)  in  the  z-plane  are  simply  the  roots  of 

—  1  —2 

the  quadratic  polynomial  1-a^z  -  a2z  in  the  denominator  of  (1) 

ai  ^ 

zl,2  =  “  ±  V  ”4”  +  a2  ’  (7) 

Depending  on  the  values  of  a^  and  a2,  the  poles  and  Z2  are 
either  both  real  or  form  a  complex  conjugate  pair.  Conversion  ol 


the  poles  to  the  s-ple.ne  is  accomplished  by  setting 
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I  | 

I 


gsT  =  e  ( a  +  j  « )  T  _  02trT(h+jf) 


(8) 


where  T  is  the  sampling  interval,  f  is  the  frequency 

of  the  pole, 

and  h  is  defined  to  be  the  half-bandwidth  of  the  pole. 


If  a  pole  is  at  z  =  zr-(  jz1,  then: 

f  s  z  j 

f  -  -r—  arctan  — 

2tt  zr 

f  q  2  2 

h  =  -A  loq  (z/+Z,.  ) 


where 


fs  = 


37 

1 


(9) 

(10) 


is  the  sampling  frequency. 


This  completes  the  specification  of  the  two  poles.  As  can 
be  seen  from  the  above,  the  computations  are  straightforward. 

Note  that  if  the  model  had  more  than  two  poles,  one  would  have 
to  find  the  roots  of  a  polynomial  of  degree  greater  than  2,  which 
is  not  a  straightforward  task. 


III.  ADEQUACY  OF  REP RES KNTATION 


In  this  section  we  show  that  the  two-pole  model  is  adequate 
for  representing  gross  characterizations  of  speech  spectra. 

The  possible  positions  for  the  two  poles  z-^  and  z2  form 
four  distinct  cases.  Figure  1  shows  the  four  possible 
prototype  amplitude  responses  for  the  two-pole  model.  Each 
amplitude  response  is  computed  along  the  unit  circle  from  z=l  to 
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FIGURE 


TWO -POLE  CONFIGURATIONS  IN  z- PLANE 


A.  COMPLEX 
CONJUGATE 
PAIR 


i- PLANE  POLES 


a  REAL  POLES 


AMPLITUDE  RESPONSE 

0  f  V2 

v. 

o  f  yz 

0  f  V  2 

0  f  »s/2 


1:  The  four  possible  configurations  of  the  two-pole 
model  and  representative  spectra. 
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z=-l,  which  corresponds  to  a  plot  from  zero  frequency  to  half 
the  sampling  frequency.  The  first  case  is  that  of  the  familiar 
complex  conjugate  pair.  The  amplitude  response  is  completely 
specified  by  the  frequency  and  bandwidth  of  one  pole.  For  the 
case  of  real  poles,  there  are  three  possibilities.  The  poles  can 
be  either  both  positive,  both  negative,  or  one  positive  and  one 
negative.  A  positive  real  pole  (in  the  z-plane)  corresponds  to 
a  pole  at  zero  frequency  and  indicates  energy  concentration  at 
low  frequencies.  A  negative  real  pole  corresponds  to  a  pole  at 
half  the  sampling  frequency  and  indicates  energy  concentration 
at  high  frequencies. 

All  four  prototype  cases  shown  in  Fiqure  1  do  occur  when 
modeling  speech  spectra.  In  order  to  give  a  flavor  of  horf  and 
when  these  four  cases  occur,  we  present  a  few  examples  in 
Figures  2-6.  In  each  of  the  examples,  the  speech  signal  was 
low-pass  filtered  at  4.5  kHz  and  sampled  at  10  kHz.  The  two-pole 
spectrum  (i.e.  |H(f)|~)  is  shown  superimposed  over  the  actual 

speech  spectrum  being  modeled.  In  each  case,  the  corresponding 
speech  sound  is  shown  along  with  the  pole  parameters:  f  repre¬ 
sents  the  pole  frequency  and  h  the  pole  ha 1 f- bandwidth  in  Hz. 

For  example.  Figure  2b  shows  the  two-pole  model  for  an  example 
of  the  sound  [t]  .  The  model  has  a  pair  of  conjugate  poles  at 
473  Hz  with  a  half-bandwidth  of  57  Hz.  Figures  2  and  3  show 
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FIGURE  3 


:  Examples  of  consonant  spectra  modelled  by  complex 
conjuqate  pairs  of  poles. 
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examples  where  the  spectrum  is  modeled  by  complex  conjugate 
pole  pairs.  Figure  3b  corresponds  to  the  spectrum  of  the  burst 
in  the  plosive  (pi.  Note  that  b^  simply  changing  the  frequency 
and  bandwidth  of  the  conjugate  pair  of  poles,  many  different 
spectral  shapes  can  be  accomodated. 

Figure  4  shows  two  examples  where  the  spectrum  is  modeled 
by  two  positive  poles  in  the  z-plane,  i.e.  both  poles  are  at 
zero  frequency.  In  Figure  4b,  the  relatively  high  energy 
at  both  low  and  mid  frequencies  resulted  in  a  model  with  two 
positive  real  poles,  instead  of  a  complex  conjugate  pair,  which 
is  more  common  for  vowels. 

Figure  5  shows  two  examples  where  the  spectrum  is  modeled 
by  one  positive  and  one  negative  pole  in  the  z-plane.  Figure  5a 
corresponds  to  a  vowel-fricative  transition  while  Figure  5b 
corresponds  to  a  voiced  fricative.  In  both  cases  there  is 
energy  concentration  both  at  low  and  high  frequencies. 

Finally,  Figure  6  shows  an  example  where  the  spectrum  is 
modeled  by  two  negative  poles,  i.e.  both  poles  are  at  half  the 
sampling  frequency  (5  kHz). 

The  above  examples  qive  a  good  indication  of  the  adequacy 
of  representation  of  the  two-pole  model  for  the  gross 
characterization  of  speech  spectra.  Below  we  discuss  how  one 
interprets  results  of  a  two-pole  model  for  segmentation  and 
broad  classification. 
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IV.  SEGMENTATION  AND  BROAD  CLASSIFICATION 

Using  the  two-pole  model  in  the  recognition  of  continuous 
speech  suggests  that  the  spectral  characterization  described 
above  be  performed  at  regular  closely  spaced  points  throughout 
the  utterance,  producing  a  multi-parametric  description  of  the 
signal.  The  two— pole  model  for  each  point  can  be  represented 
by  the  frequencies  and  bandwidths  of  the  two  poles.  This  type 
of  representation  is  reasonable  for  complex  conjugate  poles 
since  there  is  only  one  frequency  and  one  bandwiclch  to  interpret. 
The  frequency  indicates  the  position  of  the  main  region  of 
energy  concentration,  and  the  bandwidth  indicates  the  spread  of 
energy  in  that  region.  However,  in  the  case  of  real  poles, 
we  have  to  deal  with  two  possibly  distinct  frequencies  and  two 
bandwidths.  The  frequencies  are  always  either  zero  or  equal  to 
half  the  sampling  frequency,  and  are  easily  interpretable,  as 
shown  below.  On  the  other  hand,  interpretation  of  two  distinct 
bandwidths  is  far  from  straightforward,  especially  when  the  two 
frequencies  are  identical. 

We  have  found  that  the  bandwidth  information  can  be 
represented  in  a  more  helpful  manner  in  terms  of  the  dynamic 
range  of  the  two-pole  spectrum  and  the  direction  or  sign  of  its 
"slope".  We  define  the  spectral  dynamic  range  to  be  the 
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difference  in  decibels  between  the  hiqhest  and  lowest  amplitude 
points  on  the  two-pole  spectrum.  The  slope  of  the  two-pole 
spectrum  is  either  positive  or  negative:  It  is  positive  if  the 
energy  is  concentrated  above  the  midpoint  of  the  spectrum 
(2.5  kHz  in  our  case)  and  negative  otherwise. 

From  (1)  and  (7),  it  is  simple  to  derive  formulas  for  the 
two-pole  dynamic  range  D  and  the  sign  S  of  the  two-pole  slope. 
There  are  four  distinct  cases. 


(11a) 

(lib) 


(11c) 


(lid) 

(lie) 
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If  sign(a)  /  sign(b),  let  A  = 
then  S  =  -•  sign  (a) 


2  B  = 


D  =  10  log10 


A  (1+B2 )  +  BJ1+A2 ) 


(1-A) ( 1+B ) 


(llq) 


It  should  be  clear  from  the  above  that,  in  the  case  of 
complex  conjugate  poles,  the  spectral  dynamic  ranae  D  uniquely 
specifies  the  bandwidth  of  the  poles.  For  real  poles,  the 
spectral  dynamic  range  is  an  intuitive  substitute  for  bandwidth 
information  but  does  not  specify  it  uniquely.  The  sign  of  the 
spectral  slope  gives  additional  useful  information  only  when 
the  two  poles  are  real  with  one  pole  at  zero  frequency  and  the 
other  at  half  the  sampling  frequency. 

The  behavior  of  the  two-pole  model  when  applied  at  regular 
intervals  to  an  utterance  is  shown  in  Figure  7.  The  utterance  is 
"Has  anyone  measured  nickel  concentrations..."  The  two-pole 
analysis  was  performed  at  10  msec  intervals  over  20  msec 
Hamming-windowed  segments  of  the  waveform.  The  pole  frequencies 
are  plotted  as  a  single  point  where  they  are  identical,  and  as 
two  points  for  those  frames  where  one  pole  is  at  zero  and  one  is 
at  5000  Hz.  Note  that  the  scale  of  the  frequency  plot  is  linear 
only  from  0  to  500  Hz,  then  logarithmic  to  5000  Hz.  Between  the 
pole  frequency  and  dynamic  range  plots,  the  frames  in  which  the 
two-pole  slope  is  positive  are  indicated. 
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5- 


TWO- POLE 
SLOPE  =  + 


SECONDS 


FIGURE  7:  Two-pole  frequency,  "slope",  and  spectral  dynamic 
ranqe  at  10  msec  intervals  in  the  utterance  "Has  anyone 
measured  nickel  concentrations..." 
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Many  segment  boundaries,  particularly  those  where  a  chanqe 
of  manner  of  articulation  takes  place,  are  clearly  marked  by 
abrupt  or  rapid  changes  in  the  two-pole  frequencies,  which 
often  switch  from  one  of  the  four  prototype  models  to  another. 
Every  such  switch  is  a  marker  of  spectral  change,  but  not  every 
one  will  mark  a  segment  boundary.  For  example,  at  both  the 
beginning  and  end  of  the  { z ]  at  t=0.08  to  0.14  seconds,  there 
is  a  frame  in  which  both  poles  are  at  0,  while  the  middle  of 
the  [z]  has  a  string  of  frames  with  one  pole  at  0  and  one  at 
5000  Hz.  This  is  a  common  pattern  of  transition  to  and  from 
a  fricative.  Of  course,  plosives,  particularly  unvoiced  ones, 
usually  show  transitions  corresponding  to  the  burs t-aspiration- 
phonation  sequence,  as  would  be  expected.  See  the  two  examples 
of  [k]  around  t=0.95  and  t=1.10  and  the  [t]  example  around  t=1.50. 

Many,  but  not  all,  sonorant  sequence  transitions  exhibit 
a  rapid  change  in  the  two-pole  frequency,  which  tends  to  follow 
the  first  formant  if  it  is  dominant,  or  lies  between  Fl  and  F2 
if  F2  is  close  enough  to  Fl.  See  for  example  t=0.35,  t=0.50, 
and  t=0.85. 

Not  all  segments  with  conjugate  poles  are  i,onorants,  and 
vice  versa.  For  example,  (i]  around  t=0.25  is  modeled  with  two 
real  poles,  in  the  way  also  illustrated  by  Figure  4b.  It  is  also 
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common  for  nasals  to  be  modeled  by  two  real  poles,  as  the  [n] 
around  t=0.45,  or  very  low-frequency  conjugate  poles,  as  the 
[m]  immediately  following. 

Any  occurrence  of  a  pole  at  5000  Hz  indicates  strident 
frication,  or  an  equivalent  burst,  as  do  conjugate  poles  above 
about  1  kHz.  Most  examples  of  [s],  [si,  and  [z]  will  show 
this  during  at  least  some  of  theii  extent. 

The  two-pole  dynamic  range  is  quite  high  during  nasals, 
because  of  the  dominance  of  the  low  first  formant.  This  is 
quite  a  reliable  indication.  Conversely  the  dynamic  range  is 
usually  quite  low  during  unvoiced  fricatives.  The  measure  is 
not  quite  as  reliable  during  voiced  fricatives.  A  positive 
two-pole  slope  is,  of  course,  a  strong  indication  of  strident 
frication. 

The  gross  characterizations  of  speech  spectra  given  by 
the  two-pole  model  are  certainly  not  sufficient  in  and  of 
themselves  to  segment  and  roughly  label  continuous  speech,  but 
they  do  pc int  to  a  large  proportion  of  segment  boundaries. 
Together  with  other  obvious  measurements  such  as  energy  and 
voicing,  they  form  a  powerful  combination  for  the  initial 
stages  of  speech  recognition. 
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V.  AN  ALTERNATIVE  MEASURE  TO  THE  SPECTRAL  DYNAMIC  RANGE 

The  two-pole  dynamic  range  i.;  a  rather  intuitive  measure 
of  one  aspect  of  spectral  shape,  that  is,  it  is  easily 
visualized  from  a  graph  of  the  spectrum.  A  clearly  related 
(but  easier  to  compute)  measure  is  the  normalized  minimum  error 
V,  given  by  (6).  It  can  be  shown  that  the  measure  V  is  equal 
to  the  ratio  of  the  geometric  mean  of  the  two-pole  spectrum 
to  its  arithmetic  mean  (see  [1] ,  pp.  109-115) .  It  has  been 
known  for  some  time  that  the  ratio  of  the  geometric  mean  to  the 
arithmetic  mean  is  a  good  measure  of  the  spread  of  the  data. 

For  smooth  spectra  (as  is  the  case  for  a  two-pole  spectrum)  the 
spectral  dynamic  range  is  also  a  good  measure  of  the  spread  of 
the  spectrum.  It  is  not  surprising,  therefore,  that  the  two 
measures  should  behave  in  a  similar  fashion.  This  similarity 
is  illustrated  in  Figure  8,  which  shows  200  values  of  V  versus 
D  for  the  two  seconds  of  continuous  speech  shown  in  Figure  7. 
The  continuous  curve  also  plotted  in  Figure  8  is  that  of  V  ,  th< 
absolute  lower  bound  on  V  for  each  value  of  D  (see  [1],  pp.  116* 
120) .  The  data  points  themselves  fall  within  a  very  well 
defined  region,  suggesting  for  two-pole  spectra  a  tighter 
lower  bound  (and  also  an  upper  bound)  than  the  Vm  versus  D 
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FIGURE  8:  Two-pole  normalized  error  vs.  spectral  dynamic  ranqe 
for  the  200  data  points  in  the  utterance  in  Fiqure  7.  The  solid 
curve  is  Vm,  the  absolute  lower  bound  on  the  normalized  error. 
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Since  the  normalized  error  V  is  somewhat  easier  to  compute 
than  the  spectral  dynamic  range  D,  and  since  it  leads  to  very 
similar  results,  our  suggestion  here  is  that  it  might  be 
preferable  to  use  V  in  actual  implementations. 
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